18 results
3 - Families of Dissimilarity between Nodes
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 102-142
-
- Chapter
- Export citation
-
Summary
Introduction
This chapter is a follow-up to the previous chapter. It presents more advanced material involving recent attempts to define useful distances and similarities between nodes of a graph. While meaningful in many contexts and popular, the shortest-path distance does not convey information about the degree of connectivity between the nodes. In some occasions, we would like a distance that also captures the information about their connection rate, with a high connectivity being considered as an indication that the two nodes are close in some sense (e.g., they can easily exchange information). In other words, the presence of many indirect paths (as opposed to direct links) between nodes also suggests some kind of proximity between them.
As seen in the previous chapter, the resistance distance and the commute-time distance capture this property. However, we also saw that these quantities suffer from the fact that, when the graph becomes larger, they converge to a meaningless limit function (see [790, 792] or very recently [370], and the previous chapter, Section 2.5.3). This effect was called “being lost in space” in [790] and is related to the fact that a simple random walk mixes before hitting its target [370].
This means that both the shortest-path distance and the (Euclidean) commutetime distance have some inconvenient flaws, at least in the case of large graphs, and depending on the application. In some sense, they can be considered as two extremes of a continuum, considering only the length at one end, and considering only connectivity (and without taking care of the length) at the other end.
In this context, several researchers recently proposed towork with parametric dissimilarities or distances interpolating between the shortest-path distance and the commutetime distance [20, 155, 157, 292, 459, 833]. They all depend on a continuous parameter and therefore define “families of distances.” At one limit of the value of the parameter, these quantities converge to the shortest-path distance while at the other end, they converge to the commute-time distance. They therefore “interpolate” between the two distances. The idea is that when the parametric, interpolated, distance is not too far from the shortest-path distance, it integrates the degree of connectivity between the nodes into the distance while not being too sensitive to the effect of “being lost in space” [790].
Index
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 515-521
-
- Chapter
- Export citation
6 - Labeling Nodes: Within-Network Classification
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 235-275
-
- Chapter
- Export citation
-
Summary
Introduction
This chapter introduces some techniques to assign a class label to an unlabeled node, based on the knowledge of the class of some labeled nodes as well as the graph structure. This is a form of the task known as supervised classification in the machine learning and pattern recognition communities. Consider for example the case of a patents network [554] where each patent is a node and there is a directed link between two patents i and j if i cites j. In addition to the resulting graph structure, some information related to the nodes could be available, for instance, the industrial area of the patent (chemicals, information and communication technologies, drugs and medicals, electrical and electronics, etc.). Assume that the industrial area is known for some patents (labeled nodes) but not yet known for some other nodes (unlabeled nodes). The within-network classification or node classification task [89] aims to infer the label of the unlabeled nodes from the labeled ones and the graph structure.
As discussed in [553], within-network classification falls into the semisupervised classification paradigm [2, 10, 152, 844, 847]. The goal of semisupervised classification is to learn a predictive function using a small amount of labeled samples together with a (usually large) amount of unlabeled samples, the labels being missing or unobserved for these samples. Semisupervised learning tries to combine these two sources of information (labeled + unlabeled data) to build a predictive model in a better way than simply using the labeled samples alone, and thus simply ignoring the unlabeled samples. Indeed, in general, labeled data are expensive (think, for example, about an expert who has to label the cases manually), whereas unlabeled data are ubiquitous, for example, web pages. Hence, trying to exploit the distribution of unlabeled data during the estimation process can prove helpful. Among popular semisupervised algorithms, we find co-training, expectation-maximization algorithms, transductive inference, and so on – for a comprehensive survey of the topic see, for example, [844, 847].
However, to be effective, semisupervised learning algorithms on a graph rely on some strong assumptions about the distribution of these labels. The main assumption is that neighboring nodes are likely to belong to the same class and thus to share the same class label.
8 - Finding Dense Regions
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 349-389
-
- Chapter
- Export citation
-
Summary
Introduction
Besides clustering, which is the task of partitioning graph nodes into disjoint subsets, it could also be interesting to identify dense regions inside the graph G. In that case, we are not trying to find a partition of G but only some subsets of nodes that are highly interconnected.
Density is an important concept in graph analysis and has been proven to be of particular interest in various areas, such as social networks, biology, and the World Wide Web [543, 524, 277]. It can be defined in many ways based on various concepts (subgraph connectivity, cliques, cores, subgraph density, etc.), leading to various approaches, as described in this chapter.
We first investigate some well-known local density measures or indices. The aim of these local density indices is to provide ameasure of the extent to which a local subset of nodes, centered on a particular node, is highly cohesive, that is, highly interconnected. In other words, these measures try to answer questions like, “Do friends of a node tend to be friends of one another?” or “Is the friend of my friend also my friend?”
Then, some global measures, smoothing the local density over the network, are presented. These tend to be more robust with respect to local variations of the density.
Thereafter, a few bottom-up agglomerative methods are described. These techniques allow highly dense regions to be detected, by extending them gradually in a sequential way according to a greedy algorithm. These methods are also known as hierarchical clustering techniques in multivariate statistics, pattern recognition, data mining, and machine learning. They are very useful when exploring the structure of the network.
Finally, a heuristics for maximum clique detection is briefly described.
Basic Local Density Measures
Many measures of local density of a graph were proposed [123, 469, 608, 804]; only a few popular choices are discussed here. The aim of these local density indices is to provide ameasure of the extent to which a local subset of nodes, sometimes centered on a particular node, is highly cohesive. Cohesiveness of a subgraph can be characterized in several distinct ways [708], one of the most intuitive being subgraph connectivity. Intuitively, if a subgraph is cohesive, it should be possible to remove some of its nodes without disconnecting the subgraph.
4 - Centrality Measures on Nodes and Edges
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 143-200
-
- Chapter
- Export citation
-
Summary
Introduction
A large number of different centrality measures have been defined in the fields of social science, physics, computer sciences, and so on. By exploiting the structure of a graph, these quantities assign a score to each node of the graph G to reflect the extent to which this node is “central” with respect to G or a subgraph of G, that is, with respect to the communication flowbetween nodes construed in a broad sense. Centrality measures tend to answer the following questions: What is the most representative, or central, node within a given community? How critical is a given node with respect to information flow in a network? Which node is the most peripheral in a social network? Centrality scores attempt to tackle these problems by modeling and quantifying these different, vague, properties of nodes.
In general, these centrality measures are computed on undirected graphs or, when dealing with a directed graph, by ignoring the direction of the edges. They are therefore called “undirectional” [804]. Measures defined on directed graphs – and which are therefore directional – are often called importance or prestige measures, and are discussed in the next chapter. They capture the extent to which a node is “important,” “prominent,” or “prestigious” with respect to the entire directed graph by considering the directed edges as representing some kind of endorsement. Therefore, in this chapter, unless otherwise stated, all networks are considered to be undirected.
As discussed in [469], several attempts have been made to define a typology of centrality measures according to various criteria – for instance a node's involvement in the walk structure of a network; see, for example, [111, 123, 294] for details. In this chapter, only some of the most popular measures are described. For a more detailed account, see, for example, [105, 111, 804].
More precisely, three types of centrality measures are discussed in this chapter:
▸ closeness centrality, quantifying the extent to which a node, or a group of nodes, is central to a given network, that is, its proximity to other nodes in the graph
Contents
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp v-xii
-
- Chapter
- Export citation
List of Symbols and Notation
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp xvii-xxii
-
- Chapter
- Export citation
Frontmatter
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp i-iv
-
- Chapter
- Export citation
List of Algorithms
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp xiii-xvi
-
- Chapter
- Export citation
Bibliography
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 479-514
-
- Chapter
- Export citation
7 - Clustering Nodes
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 276-348
-
- Chapter
- Export citation
-
Summary
Introduction
This chapter introduces several methods of clustering the nodes of a graph into a partition. In multivariate statistics and data analysis [413, 429, 560], pattern recognition [418, 761, 807], data mining [361, 372], or machine learning [23, 91], clustering means grouping a set of objects into subsets, or clusters, such that those belonging to the same cluster are more “related” than those belonging to different clusters.1 In other words, a clustering provides a partition of the set of objects into disjoint clusters such that members of a cluster are highly “similar” while objects belonging to different clusters are dissimilar [264, 303, 418, 821, 824]. Of course, this supposes three different ingredients:
▸ a measure of similarity or dissimilarity between the objects
▸ a criterion, also called cost, loss, or objective function, measuring the quality of a partition
▸ an optimization technique, or procedure, for computing a high-quality partition, according to the criterion being considered
The similarity measure could, for instance, be the similarity provided by a kernel on a graph, or simply whether the nodes are connected. In addition, the criterion could be the total within-cluster inertia induced by the kernel on a graph in the embedding space, as in the case of a simple k-means clustering.
However, most of the clustering algorithms, such as the k-means, assume that the user provides a priori the number of clusters, which is not very realistic because this number is, in general, not known in advance. There exists, however, a number of heuristic procedures to suggest a “natural” number of clusters (see for instance [576]). Thus, some clustering algorithms do not need this assumption and are therefore able to detect a number of clusters as well. These are often called community detection algorithms in the context of node clustering. One popular example of a community detection algorithm is modularity optimization, which is described in this chapter.
There exist several different types of clustering algorithms [6, 264, 303, 418, 761, 821, 824], the most prominent ones being the following:
▸ Top-down, divisive, techniques, also called partitioning or splitting methods. These methods start from an initial situation where all the nodes of the graph are contained in only one cluster.
5 - Identifying Prestigious Nodes
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 201-234
-
- Chapter
- Export citation
-
Summary
Introduction
Many different measures of prestige and centrality of a node have been defined in social science, computer science, physics, and applied mathematics. Some authors call these measures “importance,” “standing,” “prominence,” or “popularity,” especially in the case of social networks. In this book, when the graph is directed, we speak about prestige and importance (they are used interchangeably), whereas in the case of an undirected graph, the concept is called centrality.
We therefore assume in this chapter that the elements, or weights, aij, of the adjacency matrix can be interpreted as a volume of endorsement, faith, credit, or citation, from object i to object j – this could be the number of references from i to j, the degree of confidence i has in j, and so on. Moreover, the graph is assumed to be directed, leading to a nonsymmetric adjacency matrix.
In some situations, though, we encounter the case where the weights on the arcs represent the amount of “influence” or “dominance” a node i has on node j, instead of endorsement. In this situation, the graph containing the reversed directed links (and whose adjacency matrix is thus AT) can be interpreted as a new graph whose links represent some kind of endorsement. Indeed, if the links i → j model a relation of the type “i influences j,” the reverse relation j → i can usually be interpreted as “j gives credit to i.” Thus, if a graph is representing an influence relation, it usually suffices to transpose its adjacency matrix to recover a endorsement-like relation. This shows that we must of course be careful about the meaning of the relation between nodes defining G, which should be clearly defined and interpreted.
In summary, this chapter is concerned with prestige measures (i.e., scores, or ratings) quantifying the importance of a node in a directed graph whose edges carry some “endorsement” relation. In this context, the prestige of a node increases as it becomes the object of more positive citations or endorsements (incoming links) [804]. Numerous measures were developed in the social sciences, only the most popular ones being introduced in this chapter. For other such measures and more details about the discussed measures, the interested readers are advised to consult, for example, reference [804].
Preface
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp xxiii-xxvi
-
- Chapter
- Export citation
-
Summary
The network science field. Since the start of the twenty-first century, network science, the field whose main goal is to analyze network data, has become more and more popular in various areas of science and technology [47]. This interest has grown in parallel with the popularity of large networks, especially online networks like theWorld Wide Web, where each node is a web page and hyperlinks can be viewed as edges linking the pages. Another obvious example is online social networks like Facebook, where nodes are persons and links are friendship relations.
Although networks have been studied for years in the fields of social network analysis, operations research, graph theory, and graph algorithmics, the wide availability of such network structures on the Internet clearly boosted the field in the late 1990s. Computer scientists, physicists, chemists, economists, statisticians, and applied mathematicians all started to analyze network data.
In computer science, the field was called link analysis, while in physics, it was more often known as network science, a term that is now used across most disciplines. Roughly speaking, link analysis and network science aim at analyzing and extracting information from complex relational data (observed relations between entities like people, web pages, etc.) and is considered, in physics, to be a subfield of complex systems. The book is dedicated to this subject.
Intended audience. We have written this book for upper-level undergraduate or graduate students, researchers, and practitioners involved, or simply interested, in network data analysis. The book is not, however, intended as an introduction to network science. We assume that the reader has already followed an introductory course on graphs and networks (e.g., [47, 258, 468, 522, 608, 781] or the chapters dedicated to network data in [836]) as well as elementary courses in computer science, probability, statistics, and matrix theory. We nevertheless start with an introductory chapter, “Preliminaries and Notation,” summarizing the necessary slightly more advanced material and the notation.
While the material of the book is oriented toward computer scientists and engineers, we think that it should also attract students, researchers, and practitioners in other fields having an interest in network science. The material can easily be followed by other scientists in many application areas, provided they have the basic background knowledge outlined previously.
10 - Graph Embedding
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 437-478
-
- Chapter
- Export citation
-
Summary
Introduction
The general purpose of graph embedding is to associate a position or vector in a Euclidean space – usually of low dimensionality – to each node of the graph G. The Euclidean space in which the nodes are represented as points is called the embedding space. The points themselves are called and defined by node vectors {xi}ni=1 and their coordinates are gathered in a data matrix X. This mapping thus corresponds to a configuration of the nodes in a Euclidean space preserving the structure of the graph as much as possible. For instance, a useful property of such a mapping would be that the neighbors of each node in G are also neighbors of the same nodes in the embedding space, according to the Euclidean distance in this space, and vice versa, that the neighbors of each node in the embedding space are also neighbors in the graph [512]. When the embedding space has dimension two or three, this technique is also called graph drawing and provides a layout of the graph that can be drawn.
Consider for instance the example of a social network defined by its adjacency matrix. It would be nice to have a three-dimensional drawing of this network in which we can navigate. This reduces to the computation of a configuration of the nodes in the three-dimensional Euclidean space preserving the structure of the graph together with some aesthetic properties [17] (for a sample of nice ways to draw a graph, see, e.g., [533]).
There is a vast literature on graph embedding and drawing. This chapter only presents a few methods, starting with some spectral methods which define the embedding according to eigenvectors of graph-related matrices. According to [473, 474], these techniques have two distinctive advantages:
▸ They provide a sound formulation minimizing a well-defined criterion, which almost always leads to an exact closed-form solution to the embedding problem.
▸ The solutions can be computed exactly, even for relatively large graphs while in other formulations (e.g., spring models or other physical models), the solution can usually only be approximated.
2 - Similarity/Proximity Measures between Nodes
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 59-101
-
- Chapter
- Export citation
-
Summary
Introduction
This chapter is concerned with the similarity and its dual, dissimilarity, between nodes of a graph. The need to quantify the similarity between objects arises in many situations, not only in network analysis. Indeed, similarity has been an important and widely used concept in many fields of research for years.
Having its origins in, among others, psychology in the work of Gustav Fechner of the 1860s, the concept of similarity has evolved over the years, as many similarity measures have been proposed in various fields such as feature contrast models [778], mutual information [384], cosine coefficients [289], and information content [666] (see [212] for a survey). The core idea behind a similarity measure is to exploit relevant information for determining the extent to which two objects are similar or not in some sense [212, 688, 761]. The simple intuitions behind the concept of similarity are summarized by Lin in [535]:
▸ The similarity between two objects is related to their commonality. The more commonality they share, the more similar they are.
▸ Symmetrically, the similarity between two objects is related to the differences between them. The more differences they have, the less similar they are.
▸ The maximum similarity between two objects is reached when the two objects are identical, no matter how much commonality they share.
Notice, however, that some popular similarity measures do not satisfy all of them. For instance, inner product similarity does not meet the third condition, unless it is normalized (in which case it is equivalent to cosine similarity).
To measure the similarity between nodes of a graph, two complementary sources of information can be used:
▸ the features (or attributes) of the nodes, or
▸ the structure of the graph
The former refers to the fact that two nodes of the graph are considered to be similar if they share many common features, while the latter refers to the fact that two nodes of the graph are considered to be similar if they are “structurally close” in some sense in the network. Both kinds of information can be combined, of course.
9 - Bipartite Graph Analysis
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 390-436
-
- Chapter
- Export citation
-
Summary
Introduction
A bipartite graph (see, e.g., [9, 233, 432, 469, 810]), or shortly a bigraph, is a graph such that the node set may be partitioned into two disjoint sets, and each edge has one endpoint in the first set of nodes and the other in the second set of nodes (see Figure 9.1). In social sciences, such bipartite graphs are also called two-mode network data [804]. This type of network is very common and deserves special attention with specific techniques and models [101, 110, 482, 495].
The two sets of nodes will be called the left set (denoted as X or X-set) and the right set (denoted as Y or Y-set), respectively. We thus obviously have X ⋂ Y = ∅, X ⋃ Y = V, and each link between nodes i and j, i → j ϵ ε, either has i ϵ X, j ϵ Y or i ϵ Y, j ϵ X. The number of nodes in the left set (right set) is nx = |X| (ny = |Y|).
Bipartite graphs naturally appear in applications involving two types of objects, or objects playing different roles. Such applications include, among others, collaborative recommendation, reputation models, items rating, information retrieval, and matching problems. In the context of collaborative recommendation, the left set contains consumers while the second set contains items that were purchased by the consumers. In that situation, there is a link between a consumer i (left set) and an item j (right set) if and only if the consumer i purchased the item j. The weight of the link can be set, for example, to the number of times consumer i bought item j. In information retrieval, the left set contains documents while the right set contains words (the well-known document-term matrix). Each link is weighted, for example, by the number of times the term is contained in the document.
In many situations, bipartite graphs are related to the analysis of contingency tables [11, 92, 101], also called frequency tables or co-occurrence data. Indeed, in the case of an undirected weighted bipartite graph, if the weights can be interpreted as a number, or magnitude, of interactions between a node of the left set and a node of the right set, the data can also be viewed as a contingency table.
1 - Preliminaries and Notation
- François Fouss, Université Catholique de Louvain, Belgium, Marco Saerens, Université Catholique de Louvain, Belgium, Masashi Shimbo
-
- Book:
- Algorithms and Models for Network Data and Link Analysis
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016, pp 1-58
-
- Chapter
- Export citation
Algorithms and Models for Network Data and Link Analysis
- François Fouss, Marco Saerens, Masashi Shimbo
-
- Published online:
- 05 July 2016
- Print publication:
- 12 July 2016
-
Network data are produced automatically by everyday interactions - social networks, power grids, and links between data sets are a few examples. Such data capture social and economic behavior in a form that can be analyzed using powerful computational tools. This book is a guide to both basic and advanced techniques and algorithms for extracting useful information from network data. The content is organized around 'tasks', grouping the algorithms needed to gather specific types of information and thus answer specific types of questions. Examples include similarity between nodes in a network, prestige or centrality of individual nodes, and dense regions or communities in a network. Algorithms are derived in detail and summarized in pseudo-code. The book is intended primarily for computer scientists, engineers, statisticians and physicists, but it is also accessible to network scientists based in the social sciences. MATLAB®/Octave code illustrating some of the algorithms will be available at: http://www.cambridge.org/9781107125773.